<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<html>
<head>
<title>org.archive.io.warc package</title>
</head>
<body>
Experimental WARC Writer and Readers.  Code and specification subject to change
with no guarantees of backward compatibility: i.e. newer readers
may not be able to parse WARCs written with older writers. 

This code, with noted exceptions, is a loose implementation of parts of the
(unreleased and unfinished)
<a href="http://archive-access.sourceforge.net/warc/warc_file_format-0.9.html">WARC
File Format (Version 0.9)</a>. Deviations from 0.9, outlined below in the
section <i>Deviations from Spec.</i>, are to be proposed as amendments to the
specification to make a new revision.  Since the new spec. revision will likely
be named version 0.10, code in this package writes WARCs of version 0.10 -- not
0.9.


<h2>Implementation Notes</h2>
<h3>Tools</h3>
<p>Initial implementations of <code>Arc2Warc</code> and <code>Warc2Arc</code>
tools can be found in the package above this one, at
{@link org.archive.io.Arc2Warc} and {@link org.archive.io.Warc2Arc}
respectively.  Pass <code>--help</code> to learn how to use each tool.
</p>

<h3>Unique ID Generator</h3>
<p>WARC requires a GUID for each record written. A configurable unique ID
{@link org.archive.uid.GeneratorFactory}, it can be configured to use alternate
unique ID generators, was added with a default of
{@link org.archive.uid.UUIDGenerator}.  The default implementation generates
<a url="http://en.wikipedia.org/wiki/UUID">UUIDs</a> (using java5
<code>java.util.UUID</code>) with an <code>urn</code> scheme using the uuid
namespace [See <a href="http://www.ietf.org/rfc/rfc4122.txt">RFC4122</a>].
</p>

<h3>{@link org.archive.util.anvl ANVL}</h3>
<p>The ANVL RFC822-like format is used writing <code>Named Fields</code> in
WARCs and occasionally for metadata. An implementation was added at
{@link org.archive.util.anvl}.
</p>

<h3>Miscellaneous</h3>
<p>Writing WARCs, the <code>response</code> record type is chosen as the core
record that all others associate to: i.e. all others 
have a <code>Related-Record-ID</code> that points back to the
<code>response</code>.
</p>

<h2><a name="deviations">Deviations from Spec.</a></h2>
<p>The below deviations from spec. 0.9 have been realized in code and are to
be proposed as spec. amendments with new
revision likely to be 0.10 (Vocal assent was given by John, Gordon, and Stack
to the below at <i>La Honda</i> Meeting, August 8th, 2006).</p>

<h3>mimetype in header line</h3>
<p>Allow full mimetypes in the header line as per RFC2045 rather than
current, shriveled mimetype that allows only type and subtype.  This will mean
mimetypes are allowed <i>parameters</i>: e.g.
<code>text/plain; charset=UTF-8</code> or
<code>application/http; msgtype=request</code>.  
Allowing full mimetypes, we can support the following scenarios without
further amendment to specification and without parsers having to resort to
<code>metadata</code> records or to custom
<code>Named Fields</code> to figure how to interpret payload:
<ul>
<li>Consider the case where an archiving organization would store all
related to a capture as one record with a mimetype of 
<code>multipart/mixed; boundary=RECORD-ID</code>.  An example record
might comprise the parts 
<code>Content-Type: application/http; msgtype=request</code>,
<code>Content-Type: application/http; msgtype=response</code>, and
<code>Content-Type: text/xml+rdf</code> (For metadata).
</li>
<li>Or, an archiving institution would store a capture with
<code>multipart/alternatives</code> ranging from
most basic (or 'desiccated' in Kunze-speak)
-- perhaps a <code>text/plain</code> rendition of a PDF capture -- through to
<code>best</code>, the actual PDF binary itself.
</li>
</ul>
</p>
<p>To support full mimetypes, we must allow for whitespace between parameters
and allow that parameter values themselves might include whitespace
('quoted-string'). The WARC Writer converts any embedded carriage-return and
newlines to single space.</p>

<h3>Swap position of recordid and mimetype in the header line</h3>
<p>Because of the above amendment where we allow full mimetypes on header line,
to ease the parse, since miemtype now may include whitespace, we move the
mimetype to last position on header line and recordid to second-from-last.</p>

<h3>Use application/http instead of message/http</h3>
<p>message type has line length maximum of 1000 characters absent a
<code>Content-Type-Encoding</code> header set to <code>BINARY</code>.
(See definition of message/http for talk of adherence to MIME
<code>message</code> line limits: See 
19.1 Internet Media Type message/http and application/http in 
<a href="http://www.faqs.org/rfcs/rfc2616.html">RFC2616</a>).
</p>


<h2>Suggested Spec. Amendments</h2>
<p>Apart from the above listed <a href="#deviations">deviations</a>, the below
changes are also suggested for inclusion in 0.10 spec. revision</p>

<p>Below are mostly suggested edits.  Changes are not substantative.
</p>
<h3>Allow multiple instances of a single Named Parameter</h3>
<p>Allow that there may be multiple instances of same Named Parameter
in any one Named Parameter block.
E.g. Multiple <code>Related-Record-ID</code>s could prove of use.
Spec. mentions this in <i>8.1 HTTP and HTTPS</i> section but better
belongs in the <i>5.2 Named Parameters</i> preamble.
</p>

<p>Related, add to <code>Named Field</code> section note on bidirectional
<code>Related-Record-ID</code>.</p>


<h4>Miscellaneous</h4>
<p>LaHonda in below is reference to meeting of John, Gordon and Stack at
LaHonda Cafe on 16th St., on August 8th, 2006.
</p>
<ul>
<li>Leave off 9.2 GZIP extra fields. Big section on implementing an option
that has little to do with WARCing. AGREED at LaHonda.</li>

<li>But, we need to mark gzipped files as being WARC: i.e. that the 
GZIP is a member per resource. Its useful so readers know how to invoke
GZIP (That it has to be done once to get at any record or just need to
do per record). Suggest adding GZIP extra field in HEAD of
GZIP member that says 'WARC' (ARC has such a thing currently). NOT NECESSARY per LaHonda meeting.</li>

<li>IP-Address for dns resource is DNS Server.  Add note to this effect in
8.2 DNS.</li>

<li>Section 6. is truncated -- missing text.  What was intended here? SEE
ISO DOC.</li>

<li>In-line ANVL definition (From Kunze).  Related, can labels have
CTLs such as CRLF (Shouldn't)?  When says 'control-chars', does this include
UNICODE control characters (Should)? CHAR is described as ASCII/UTF-8 but they
are not same (Should be UTF-8).  ANVL OR NOT STILL UP IN AIR AFTER LaHonda.
Postpone to 0.11 revision.</li>

<li>Fix examples. Use output of experimental ARC Writer.</li>

<li>Fix ambiguity in spec. pertaining to 'smallest possible anvl-fields' notcited by Mads Alhof Kristiansen in <a 
href="ftp://ftp.diku.dk/diku/semantics/papers/D-548.pdf">Digital Preservation
using the WARC File Format</a>.
</li>


</ul>

<h2>Open Issues</h2>
<h3>Drop response record type</h3>
<p><code>resource</code> is sufficent. Let mimetype distingush if capture with
response headers or not (As per comment at end of <i>8.1 HTTP and HTTPS</i>
where it allows that if no response headers, use resource record type and
page mimetype rather than response type plus a mimetype of message/http: The
difference in record types is not needed distingushing between the two
types of capture)</p>
<p>Are there other capture methods that would require a response record,
that don't have a mimetype that includes response headers and content?
SMTP has rich MIME set to describe responses. Its request is
pretty much unrecordable. NNTP and FTP similar.  Because of rich MIME, no
need of a special response type here.</p>
<p>Related, do we need the <code>request</code> record?
Only makes sense for HTTP?</p>

<p>This proposal is contentious.  Gordon drew scenario where response
would be needed distingushing local from remote capture if an archiving
institution purposefully archived without recording headers or
if the payload itself was an archived record. In opposition, was suggested that
should an institution choose to cature in this 'unusual' mode, crawl metadata
could be used consulted to disambiguate confusion on how capture was done (To
be further investigated.  In general, definition of record types is still in 
need of work).
</p>
<h3>subject-url</h3>
<p>The ISO revision suggests that the positional parameter 
<code>subject-uri</code> be renamed.  Suggest <code>record-url<code>.</p>

<h3>Other issues</h3>
<ul>
<li>Should we allow freeform creation of custom Named Fields if
have a MIME-like 'X-' or somesuch prefix?
</li>

<li>Nothing on header-line encoding (Section 11 says UTF-8). 
For completeness should be US-ASCII or UTF-8, no control-chars (especially
CR or LF), etc.</li>

<li><code>warcinfo</code>
<ul><li>What for a scheme?  Using UUID as per G suggestion.</li>


<li>Also, how to populate description of crawl into warcinfo?
'Documentation' <code>Named Field</code> with list of URLs that can be assumed
to exist somewhere in the current WARC set (We'd have to make the crawler go
get them at start of a crawl).
</li>

<li>I don't want to repeat crawl description for every WARC. How to have this
warcinfo point at an original?  <code>related-record-id</code> seems
insufficent.
</li>

<li>If the crawler config. changes, can I just write a warcinfo with
differences?  How to express?  Or better as metadata about a warcinfo?
</li>
<li>In the past
we used to get the filename from this URL header field when we unsure of the
filename or it was unavailable (We're reading a Stream).  Won't be able to do
that with UUID for URL.  So, introducing new warcinfo Named Field (optional)
'Filename' that will be used when warcinfo is put at start of a file.
Allow warcinfo to have a named parameter 'Filename'?
</li>

</ul>
</li>

<li><code>revisit</code>
<ul>
<li>What to write?  Use a description field or just expect this info 
to be present in the warcinfo? Example has request header
(inside XML).  Better to use associated <code>request</code> record for this
kind of info?</li>

<li><code>Related-Record-ID</code> (RRID) of original is likely
an onerous requirement. Envisioning an implementation where we'd write
<code>revisit</code> records, we'd write such a record where content was
judged same or where date since last fetch had not changed.  If we're to
write the RRID, then we'd have to maintain table keyed by URL with value of
page hash or of last modified-date plus associated RRID (actual RRID
URL, not a hash).</li>

</ul>
</li>

<li>Should we allow a <code>Description</code> <code>Named Field</code>.
E.g. I add an order file as a metadata record and associate with a
<code>warcinfo</code> record.  Description field could say "This is Heritrix
Order file".  Same for seeds.  Alternative is custom XML packaging (Scheme
could describe fields such as 'order' file or ANVL packaging using ANVL
'comments'.
</li>

<li>Section 11, why was it we said we don't need a parameter or explicit
subtype for special gzip WARC format?  I don't remember?   Reader needs to
know when its reading a stream.  A client would like to know so it wrote
stream to disk with right suffix?  Recap. (Perhaps it was looking at
the MAGIC bytes -- if it starts with GZIP MAGIC and includes extra fields
that denote it WARC, thats sufficent?).</li>

<li>Section 7, on truncation, on 7.1, suggest values -- 'time', 'length' --
but allow free form description?
Leave off 'superior method of indicating truncation' paragraph.  This qualifier
could be added to all sections of doc -- that a subsequent revision of any 
aspect of the doc. will be superior. 
Rather than <code>End-Length</code>, like MIME, last record could have
<code>Segment-Number-Total</code>, a count of all segments that make up
complete record.
</li>
</ul>

<p>From LaHonda, discussion of <code>revisit</code> type. Definition was
tighted some by saying revisit is used when you chose not to store the capture.
Was thought possible that it
NOT require pointer back to an original.  Suggested it might have a
similarity judgment header -- <code>similiarity-value</code> -- with values
between 0 and 1.  Might also have <code>analysis-method</code> and
<code>description</code>.  Possible methods discussed included: URI same,
length same, hash of content same, judgement based off content of HTTP HEAD
request, etc.  Possible payloads might be: Nothing, a diff, the hash obtained,
etc.
</p>

<h2>Unimplemented</h2>
<ul>
<li>Record Segmentation (4.8 <code>continuation</code> record type
and the 5.2 <code>Segment-*</code> Named Parameters.  Future TODO.</li>
<li>4.7 <code>conversion</code> type. Future TODO.</li>
</ul>

<h2>TODOs</h2>
<ul>
<li>unit tests using <code>multipart/*</code> (JavaMail) reading and
writing records? Try <code>record-id</code> as part boundary.
</li>
<li>Performance: Need to add Record-based buffering.  GZIP'd streams
have some buffering because of the deflater but could probably do
w/ more.</li>
</ul>

</body>
</html>
